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DETAILED ACTION 
Specification 

1 . The amendment to the Title overcomes the objections made in the previous 
Office Action. The objections to the title are withdrawn. 

Drawings 

2. The amendments to the drawings overcome the objections made in the previous 
Office Action. The objections to the drawings are withdrawn. 

Response to Arguments 

3. Applicant's arguments filed June 9, 2005 have been fully considered but they are 
not persuasive. 

The Applicant has argued that the present application is allowable over Basu et 
al. (U.S. Patent 6,594,629) because, while the Applicant is claiming performing speech 
detection only in the case where (1) an acoustic detector detects a finite, nonzero 
acoustic response, and (2) a visual detector detecting at least one facial characteristic 
associated with speech utterances of the speaker, Basu et al. discloses additional 
embodiments (i.e. only (1), only (2), both (1) and (2), or any combination thereof). The 
fact that Basu et al. discloses additional embodiments does not mitigate the fact that the 
Applicant's preferred embodiment (both (1) and (2) must be satisfied) is still disclosed 
by Basu et al. That is, in the case where the one approach of using information from 
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both paths simultaneously is used, the speech detection is performed only when (1) and 
(2) are satisfied. Therefore, Basu et al. anticipates claim 1. 

Claim 2 amounts to essentially a restatement of the "only" limitation in claim 1 , in 
that the speech recognizer is not responsive when only (1 ) is satisfied, only (2) is 
satisfied, and neither (1) or (2) is satisfied. This is simply a logical restatement of only 
activating tlie speecli recognizer when (1) and (2) are satisfied Again, although Basu 
et al. discloses additional embodiments, the embodiment of using information from both 
paths simultaneously anticipates claim 1, and inherently anticipates not activating the 
speech recognizer when that condition is not met, as in claim 2. 

See MPEP 2123. 

4. Furthermore, with regard to the use of official notice in the rejections of claims 5. 
and 1 1 , it is noted that the applicant has not made any attempt to traverse the assertion 
of official notice, therefore the well-known in the art statement is taken to be admitted 
prior art (see MPEP 2144,03) 

» 

5. For the reasons given above, the rejections made in the previous Office Action 
stand. 

Claim Rejections - 35 USC § 102 

6. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 
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A person shall be entitled to a patent unless - 

(e) the invention was described In (1) an application for patent, published under section 122(b), by 
another filed in the United States before the invention by the applicant for patent or (2) a patent 
granted on an application for patent by another filed in the United States before the invention by the 
applicant for patent, except that an international application filed under the treaty defined in section 
351(a) shall have the effects for purposes of this subsection of an application filed in the United States 
only if the international application designated the United States and was published under Article 21(2) 
of such treaty in the English language. 

7. Claims 1-4, 6-10, 12-13, and 18-24 are rejected under 35 U.S.C. 102(e) as being 
anticipated by Basu et al. (U.S. Patent 6,594,629). 

In regard to claim 1, Basu et al. disclose a speech recognition system (Fig. 1) 
comprising: 

an acoustic detector for detecting speech utterances of a speaker (event 
detection module 28 performing audio event detection, column 15, line 66 to column 16, 
line 4); 

a visual detector for detecting at least one facial characteristic associated with 
speech utterances of the speaker (event detection module performing mouth opening 
detection, column 1 5, lines 41 -42); 

a processing arrangement connected to be responsive to the acoustic and visual 
detectors for deriving a signal having first and second values respectively indicative of 
the speaker making and not making speech utterances such that the first value is 
derived in response to the acoustic detector detecting a finite, nonzero acoustic 
response while the visual detector detects at least one facial characteristic associated 
with speech utterances of the speaker (information from the audio event detection and 
the facial feature detection are used by event detection module 28 to turn search 
module 34 on or off, column 15, lines 19-22 and lines 59-65, column 16, lines 53-56; the 


Application/Control Number: 10/058,730 Page 5 

Art Unit: 2655 

event detection module 28 inherently only indicates the user is making speech if both 
the audio and visual detectors detect speech events); and 

a speech recognizer for deriving an output indicative of the speech utterances as 
detected by the acoustic detector, the speech recognizer being connected to be 
responsive to the acoustic detector while the signal has the first value (search module 
34 is turned on when a speech event is detected by event detection module 28, column 
15, lines 59-61). 

In regard to claim 2, Basu et al. disclose search module 34 is turned on when a 
speech event is detected (column 15, lines 59-61). The event detection is a 
combination of the detection of a visual speech event (mouth opening) as well as the 
detection of an audio event (column 16, lines 53-56). The signal output by search 
module 34, therefore, is the second value (off) in response to any of: 

a) the acoustic detector not detecting a finite, nonzero acoustic response while 
the visual detector does not detect speech utterances of the speaker, 

* (b) the acoustic detector detecting a finite, nonzero acoustic response while the 
visual detector does not detect speech utterances of the speaker, and 

(c) the acoustic detector not detecting a finite, nonzero acoustic response while 
the visual detector detects speech utterances of the speaker. 

In regard to claims 3 and 6, Basu et al. disclose the processing arrangement 
includes a delay element (buffer) for assuring that the beginning of each speech 
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utterance is coupled to the speech recognizer (speech is collected in the buffer, so that 
the speech can be sent for recognition if it is determined to be speech, column 15, lines 
41-42 and lines 50-53). 

In regard to claims 4 and 7, Basu et al. discloses the delay arrangement includes 
a memory element connected to be responsive to the acoustic detector (the buffer 
collects speech data), the memory element including a plurality of stages for storing 
sequential segments of the output of the acoustic detector, the delay arrangement being 
such that the contents of the memory element stage storing the beginning of a speech 
utterance are initially coupled to the speech recognizer (speech is collected in the 
buffer, so that the speech can be sent for recognition if it is determined to be speech, 
column 15, lines 41-42 and lines 50-53). 

A buffer inherently collects sequential elements in a plurality of stages to ensure 
that the beginning of the elements are initially passed to the next stage. The buffer 
disclosed by Basu et al, therefore, inherently includes a plurality of stages for storing 
sequential segments of the output of the acoustic detector, the delay arrangement being 
such that the contents of the memory element stage storing the beginning of a speech 
utterance are initially coupled to the speech recognizer. 

In regard to claims 8 and 9, Basu et al. disclose the delay arrangement (buffer) is 
arranged for assuring that upon the completion of each speech utterance the acoustic 
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detector is decoupled from the speech recognizer (recognition is performed for each 
piece of buffered data, until no speech is uttered, column 15, lines 53-55). 

In regard to claim 10, Basu et al. discloses the delay arrangement (buffer) 
includes a memory element connected to be responsive to the acoustic detector, the 
memory element including a plurality of stages for storing sequential segments of the 
output of the acoustic detector, the delay arrangement being such that the contents of 
the memory element stage storing acoustic energy associated with the acoustic 
detector and which occurs upon completion of each speech utterance is prevented from 
being coupled to the speech recognizer (recognition is performed for each piece of 
buffered data, until no speech is uttered, column 15, lines 53-55). 

A buffer inherently collects sequential elements in a plurality of stages to ensure 
that the beginning of the elements are initially passed to the next stage. The buffer 
disclosed by Basu et al., therefore, inherently includes a plurality of stages for storing 
sequential segments of the output of the acoustic detector, the delay arrangement being 
such that the contents of the memory element stage storing acoustic energy associated 
with the acoustic detector and which occurs upon completion of each speech utterance 
is prevented from being coupled to the speech recognizer. 

In regard to claims 12 and 13, Basu et al. disclose the processing arrangement 
includes a face recognizer arranged for enabling the signal to have the first value in 
response to the speaker being at a predetermined orientation relative to the visual 
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detector connected to be responsive to ttie visual detector (a face is identified as 
'frontal' facing by frontal pose detector 20, column 7, lines 32-34 and column 15, lines 
37-38). 

In regard to claim 18, Basu et al. disclose a method of recognizing speech 
utterances of a speaker with an automatic speech recognizer responsive to acoustic . 
speech utterances of the speaker comprising: 

detecting acoustic energy having a spectrum associated with speech utterances 
(event detection module 28 performing audio event detection, column 15, line 66 to 
column 16, line 4), 

detecting at least one facial characteristic associated with speech utterances of 
the speaker (event detection module performing mouth opening detection, column 15, 
lines 41-42), and 

activating the automatic speech recognizer in response to the detected acoustic 
energy having a spectrum associated with speech utterances while the at least one 
facial characteristic associated with speech utterances of the speaker is occurring 
(information from the audio event detection and the facial feature detection are used by 
event detection module 28 to turn search module 34 on or off, column 15, lines 19-22 
and lines 59-65, column 16, lines 53-56; the event detection module 28 inherently only 
indicates the user is making speech if both the audio and visual detectors detect speech 
events). 
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In regard to claim 19, Basu et al. disclose search module 34 is turned on when a 
speech event is detected (column 15, lines 59-61). The event detection is a 
combination of the detection of a visual speech event (mouth opening) as well as the 
detection of an audio event (column 16, lines 53-56). The method disclosed by Basu et 
al. therefore comprises preventing activation of the automatic speech recognizer in 
response to any of: 

(a) no acoustic energy having a spectrum associated with speech utterances 
being detected while no facial characteristic associated with speech utterances of the 
speaker is detected, 

(b) acoustic energy having a spectrum associated with speech utterances being 
detected while no facial characteristic associated with speech utterances of the speaker 
is detected, and 

(c) no acoustic energy having a spectrum associated with speech utterances 
being detected while at least one facial characteristic associated with speech utterances 
of the speaker is detected . 

* 

In regard to claim 20, Basu et al. disclose assuring that the beginning of each 
speech utterance is coupled to the speech recognizer (with the buffer, column 15, lines 
41-42 and lines 50-53). 

In regard to claim 21 , Basu et al. disclose the beginning of each speech 
utterance is assuredly coupled to the speech recognizer by: 
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(a) delaying the speech utterance (in the buffer, column 15, lines 41-42), 

(b) recognizing the beginning of each speech utterance (with event detection 
module 28, using facial features and audio features, column 15, lines 41-42, column 15, 
line 66 to column 16, line 4, and column 16, lines 53-56), and 

(c) responding to the recognized beginning of each speech utterance to couple 
the delayed speech utterance associated with the beginning of each speech utterance 
to the speech recognizer and thereafter sequentially coupling the remaining delayed 
speech utterances to the speech recognizer (column 15, lines 50-53). 

In regard to claim 22, Basu et al. disclose assuring that no detected acoustic 
energy is coupled to the speech recognizer upon the completion of an utterance 
(recognition is performed for each piece of buffered data, until no speech is uttered, 
column 15, lines 53-55). 

In regard to claim 23, Basu et al. disclose assurance that no detected acoustic 
energy is coupled to the speech recognizer upon the completion of a speech utterance 
is provided by: 

(a) delaying the acoustic energy associated with the speech utterance (in the 
buffer, column 15, lines 41-42), 

(b) recognizing the completion of each speech utterance (no more speech event, 
column 15, lines 53-55 and 59-61), and 
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(c) responding to the recognized completion of each speech utterance to 
decouple delayed acoustic energy occurring after the completion of each speech 
utterance from the speech recognizer (the process is only repeated until no more 
speech events are detected, then search module 34 is turned off, column 15, lines 53- 
55 and 59-61). 

In regard to claim 24, Basu et al. disclose the at least one facial characteristic 
indicates the face of the speaker has a predetermined orientation relative to a detector 
involved in the step of detecting the at least one facial characteristic (a face is identified 
as 'frontar facing, column 15, lines 37-38). 

Claim Rejections - 35 USC § 103 

8. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 1 02 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

9. Claims 5 and 1 1 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Basu et al. 

Basu et al. does not disclose that the buffer is a ring buffer. 
The Applicant's admitted prior art discloses to use ring buffers when buffering 
data because they are easy to implement and ensure the buffer stays at a fixed size. 
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It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu et al. so that the buffer was a ring buffer because ring buffers 
are easy to implement and ensure the buffer stays at a fixed size. 

10. Claims 14-17 and 25-27 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al. (U.S. Patent 6,594,629, hereinafter Basu 1), in view of 
Basu et al. (U.S. Patent 6,219,640, hereinafter Basu 2). 

In regard to claim 14, Basu 1 discloses the system has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15), 

Basu 1 does not explicitly disclose distinguishing the faces of a plurality of 
speakers and enabling the speaker to have the first value in response to a speaker 
having a recognized face. 

Basu 2 discloses a system for identifying a speaker that includes a face 
recognizer (Fig. 1, face recognition 24) that distinguishes from a plurality of speakers 
(identifies the person speaking, column 6, lines 46-50). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a face recognizer and 
enable the speech recognizer when in response to a speaker having a recognized face, 
in order to get accurate recognition results when the user was in a crowd. 
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In regard to claim 1 5, Basu 1 discloses the system has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not explicitly disclose including a speaker identity recognizer to be 
responsive to the acoustic detector, to distinguish speech patterns of a plurality of 
speakers and enable the signal to have the first value in response to the speaker having 
a recognized speech pattern. 

Basu 2, discloses a speaker identity recognizer (speaker recognizer 16) arranged 
for distinguishing speech patterns of a plurality of speakers (column 5, lines 29-34). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a speaker recognizer 
enable the speech recognizer in response to the speaker's voice being recognized, in 
order to provide a second means for confirming the identity of the speaker, provide a 
backup if the facial recognizer had difficulty identifying the user. 

« 

In regard to claims 16 and 17, Basu 1 discloses the system has applications in 
speaker detection in an audience, as well as speaker recognition (identifying who is 
speaking), and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not disclose the processing arrangement is arranged for causing the 
signal to have the first value in response to the speaker having a recognized face 
matched with a recognized speech pattern of the same speaker. 


Application/Control Number: 10/058,730 Page 14 

Art Unit: 2655 

Basu 2 discloses a speaker is identified when a face is matched with a 
recognized speech pattern of the same speaker (joint identification module 30, column 
8, lines 43^7). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to enable the speech recognizer when the recognized face 
matched the recognized speech pattern of the same speaker, in order to improve the 
speaker recognition accuracy, as taught by Basu 2 (column 3, lines 32-36), 

In regard to claims 25 and 26, discloses the method has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not explicitly disclose distinguishing the faces of a plurality of 
speakers and enabling the speaker to have the first value in response to a speaker 
having a recognized face. 

Basu 2 discloses a method for identifying a speaker that includes a face 
recognizer (Fig. 1 , face recognition 24) that distinguishes from a plurality of speakers 
(identifies the person speaking, column 6, lines 46-50). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a face recognizer and 
enable the speech recognizer when in response to a speaker having a recognized face, 
in order to get accurate recognition results when the user was in a crowd. 
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In regard to claim 27, Basu 1 discloses the method has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not disclose storing images of the faces and speech patterns during 
a training period and comparing the stored images and speech patterns with images of 
the face of the speaker and the speech pattern of the speaker. 

Basu 2 discloses storing; 

(1 ) images of the faces of a plurality of speakers (column 7, lines 27-28), and 

(2) the speech patterns of the same plurality of speakers during at least one 
training period (column 6, lines 17-20, see also column 14, lines 10-12); and 

performing the distinguishing steps by comparing the stored images and speech 
patterns with images of the face of the speaker and the speech pattern of the speaker 
(joint identification, column 8, lines 43-47) 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to store images of the faces of a plurality of speakers and 
the speech patterns of the same plurality of speakers during a training period, since 
these are necessary to later identify the users. Furthermore, it also would have been 
obvious to one of ordinary skill in the art at the time of invention to enable the speech 
recognizer when the recognized face matched the recognized speech pattern of the 
same speaker, in order to improve the speaker recognition accuracy, as taught by Basu 
2 (column 3, lines 32-36). 
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Conclusion 

THIS ACTION IS MADE FINAL Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Brian L. Albertalli whose telephone number is (571) 272- 
7616. The examiner can normally be reached on.Mon - Fri, 8:00 AM - 5:30 PM, every 
second Fri off. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Wayne Young can be reached on (571 ) 272-7582. The fax phone number 
for the organization where this application or proceeding is assigned is 703-872-9306. 
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Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-21 7-91 97 (toll-free). 

BLA 7/1 9/05 



